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Abstract 

The semijoin algebra is the variant of the relational algebra obtained 
by replacing the join operator by the semijoin operator. We discuss some 
interesting connections between the semijoin algebra and the guarded frag- 
ment of first-order logic. We also provide an Ehrenfeucht-Frai'sse game, 
characterizing the discerning power of the semijoin algebra. This game 
gives a method for showing that certain queries are not expressible in the 
semijoin algebra. 



1 Introduction 

Semijoins are very important in the field of database query processing. While 
computing project-join queries in general is NP-complete in the size of the query 
and the database, this can be done in polynomial time when the database schema 
is acyclic a property known to be equivalent to the existence of a semijoin 
program |3J. Semijoins are often used as part of a query pre-processing phase 
where dangling tuples are eliminated. Another interesting property is that the 
size of a relation resulting from a semijoin is always linear in the size of the 
input. Therefore, a query processor will try to use semijoins as often as possible 
when generating a query plan for a given query (a technique known as "pushing 
projections" [7j). Also in distributed query processing, semijoins have great 
importance, because when a database is distributed across several sites, they 
can help avoid the shipment of many unneeded tuples. 

Because of its practical importance, we would like to have a clear knowledge 
of the capabilities and the limitations of semijoins. For example, Bernstein, Chiu 
and Goodman have characterized the conjunctive queries computable by 

semijoin programs. In this paper, we consider the much larger class of queries 
computable in the variant of the relational algebra obtained by replacing the 
join operator by the semijoin operator. We call this the semijoin algebra (SA). 
A join of two relations combines all tuples satisfying a given condition, called 
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the join condition. A semijoin differs from a join in the sense that it selects 
only those tuples in the first relation that participate in the join. The semijoin 
algebra is a fragment of the relational algebra, which is known to be equivalent 
to first-order logic (called relational calculus in database theory P]). 

Interestingly, there is a fragment of first-order logic very similar to the semi- 
join algebra: it is the so called "guarded fragment" (GF) El El EH] , which has 
been studied in the field of modal logic. This is interesting because the motiva- 
tions to study this fragment came purely from the field of logic and had noth- 
ing to do with database query processing. Indeed, the purpose was to extend 
propositional modal logic to the predicate level, retaining the good properties of 
modal logic, such as the finite model property. An important tool in the study 
of the expressive power of the GF is the notion of "guarded bisimulation" , which 
provides a characterization of the discerning power of the GF. 

We will show that when we allow only equalities to appear in the semijoin 
conditions, the semijoin algebra has essentially the same expressive power as 
the guarded fragment. When also nonequalities or other predicates are allowed, 
the semijoin algebra becomes more powerful. We will define a generalization of 
guarded bisimulation, in the form of an Ehrenfeucht-Frai'sse game, that char- 
acterizes the discerning power of the semijoin algebra. We will use this tool to 
show that certain queries are not expressible in SA. 

2 Preliminaries 

In this section, we give formal definitions of the semijoin algebra and the guarded 
fragment. 

From the outset, we assume a universe U of basic data values, over which 
a number of predicates are defined. These predicates can be combined into 
quantifier-free first-order formulas, which are used in selection and semijoin 
conditions. The names of these predicates and their arities are collected in the 
vocabulary Q, The equality predicate (=) is always in £1. A database schema 
is a finite set S of relation names, each associated with its arity. S is disjoint 
from O. A database D over S is an assignment of a finite relation D(R) C U" 
to each R £ S, where n is the arity of R. 

Proviso. When ip stands for a first-order formula, then <p(x\, . . . ,Xk) indi- 
cates that all free variables of tp are among x\, . . . , Xk- 

First, we define the Semijoin Algebra. 

Definition 1 (Semijoin algebra, SA). Let S be a database schema. Syntax 
and semantics of the Semijoin Algebra is inductively defined as follows: 

1. Each relation R £ S is a semijoin algebra expression. 

2. If E\, E% S SA have arity n, then also EiUE 2 , E\— E 2 belong to SA and 
are of arity n. 

3. If E € SA has arity n and X C {1, . . . , n}, then ttx(E) belongs to SA and 
is of arity #X. 
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4. If Ei,E 2 G SA have arities n and m, respectively, and 0i(xi, . . . , x n ) and 
02 • • • j x ni yit ■ ■ ■ j Dm) are quantifier-free formulas over f2, then also 
ae 1 (Ei) and E\ txg 2 E% belong to SA and are of arity n. 

The semantics of the projection, the selection and the semijoin operator are 
as follows: ir x (E) := {{ai) ieX | (<zi,...,a„) 6 E}, a 8l (E) := {(ai,...,a„) € 
E | 6<i (ai, . . . ,a n ) holds}, E x Kg 2 E 2 := {(oi, . . . ,a„) 6 E x | 3(6 X , . . . , b m ) E E 2 , 
62(0,1, ■ ■ ■ , CLni °i, ■ ■ • , 6 m ) holds}. The semantics of the other operators are well 
known. 

Now, we recall the definition of the guarded fragment. 
Definition 2 (Guarded fragment, GF). Let S be a database schema. 

1. All quantifier- free first-order formulas over S are formulas of GF. 

2. If ip and ip are formulas of GF, then so are -up, (pV ip, tp Aip, ip — ► ip and 
tp ip. 

3. If ip(x,y) is a formula of GF and a(x, y) is an atomic formula such that 
all free variables of ip do actually occur in a then 3y(a(x,y) A ip(x,y)) is 
a formula of GF. 

As the guarded fragment is a fragment of first-order logic, the semantics of GF 
is that of first-order logic, interpreted over the active domain of the database pQ. 

3 Semijoin algebra versus guarded fragment 

In this section, Q = {=} consists only of the equality predicate. Suppose fur- 
thermore that we only allow conjunctions of equalities to be used in the semijoin 
conditions; selection conditions can be arbitrary quantifier-free formulas over f2. 
We will denote the semijoin algebra with this restriction on the semijoin con- 
ditions by SA = . Before we prove that SA = is subsumed by GF, we need a 
lemma. 

Lemma 3. For every SA = expression E of arity k, for every database A and for 
every tuple z — (z\, . . . , ZfS) in E(A), there exists R in S ; an injective function 
f : {1, . . . , k} — » {1, . . . , arity(i?)}, and a tuple t in A(R) such that /\j =1 z% = 

Proof. By structural induction on expression E. □ 

Theorem 4. For every SA = expression E of arity k, there exists a GF formula 
ipE such that for every database D, E(D) = {d £ D (pE(d)}. 

Proof. The proof is by structural induction on E. 

• if E is R, then (Pe(xi, ■ ■ ■ , Xk) ■— R(x\, . . . , Xk). 

• if E is £iU£ 2 , then <p E (x\, ...,x k ):= ipE x {xi, ...,x k ) V ipE 2 (xi, ■ ■ .,x k ). 
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• if Sis Ei-E 2 , then tp E (xi, ■ ■ ■ ,x k ) ■= <Pe x {x\, ■ ■■ ,x k ) A -><Pe 3 {%u . . .,x k ). 

• if E is (Tg(Ei), then ipsixi, ...,Xk):= tpEifai, ■ ■ ■ , Xk) A 0(a:i, . . . , 

• if i? is iti 1 ,....i k (Ei) with i?i of arity n, then, by induction, ip El (zi, . . . , z n ) 
defines all tuples in E\{D). By Lemma |3 ^^(z) is equivalent to the 
formula obtained by replacing in ip := 

V V 3 (*i)ie«(*(*) A < P E 1 (tf(i),---,if(,n))) 

ReS /:{l,...,n}-> 
{l,...,arity(fl)} 

each tfu-j by Zi, i — l,...,n. In this formula, Q is a shorthand for 
the set {1, . . . , arity(i?)} — /({l, . . . , n}). Formula (p E should now only 
select components i\,...,i k out of this formula. To this end, we mod- 
ify V such that in each disjunct it quantifies over (tj)jeQ' with Q' = 
{1, . . . , arity(-R)} — ■ • ■ , ik}) and in each disjunct is replaced 

by Xi, I = 1, . . . , A;. Now (Pe{x±, . . . , x k ) is obtained. 

• if E is Ei k g i?2 with = /\* =1 lEj, = t/j, and -E2 of arity n, then, by induc- 
tion, (p El (xi, . . . ,x k ) and (p E2 (zi, . . . , z n ) define all tuples in E\(D) and 
E2{D) respectively. By Lemma|3 (f E (xi, ■ ■ ■ , x k ) is obtained by replacing 
in formula \ := 

<t>Ei(xi, ■ ■■ 1 x k ) A 

V V 3 fe)ieQ"(#(*) A (^ B2 (t /(1) ,...,t /( „))) 

iieS /:{l,...,n}-> 
{l,...,arity(H)} 

each by , I = l,...,s. Note that condition 8 is enforced by 

repetition of variables Xi t . In this formula, Q" — {1, arity (R)} — 
f({ji, ■•■,3s})- 

□ 

By the decidability of GF, we obtain: 
Corollary 5. Satisfiability of SA = expressions is decidable. 

With decidability of SA expressions, we always mean finite satisfiability, 
because a database is finite by definition. 

The literal converse statement of Theorem0|is not true, because the guarded 
fragment contains all quantifier-free first-order formulas, so that one can express 
arbitrary cartesian products in it, such as {(x,y) | S%(x) A S^y)}. Cartesian 
products, of course, can not be expressed in the semijoin algebra. Nevertheless, 
the result of any GF query restricted to a single relation by a semijoin is always 
expressible in SA = : 

Theorem 6. For every GF formula ip(xi, . . . , x k ), for every relation R (with 
arity n), for every injective function f : {1, ...,k} — > {1, . . . ,n}, the query 
{x I tp(x)} Kg R in which 9 is /\ i=1 xi = y/u), is expressible in SA = . 
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Proof. By structural induction on tp, we construct the desired semijoin expres- 
slon %,k ■ 

• if ip(x 1 ,. ..,x k ) is T(x il ,. . .,Xi,) then E*'£ ■= 7T/(i),...,/(fe)(-R) *e ?\ where 
6 is (ac,-! = j/i) A (x i2 = y 2 )A...A (x h = y t ); 

• if <p(xi,. ..,x k ) is (Xi = xj) then ££ fc := ^^/(l),.. .,/(*) 

• if tp(xi, . . . ,x k ) is -0(^1, • ■ • ,%k) V£(xi,. ..,Xk) then := Efy£uE£'*; 

• if y>(xi, . . . ,x k ) is ->V(3i, ...,Xfc) then := Kf(i),...,f(k){ R ) ~ E ^ 

• suppose <p{xi, . . . , Xk) is 3z(a(x,z) Aip(x,z)), where a is atomic with 
relation name T. Let x^ , . . . , Xi T be the different occurrences of variables 
among x x ,...,x k in a. Now, := Kf(i),...,f(k)(R) *e E^ +l where 6> is 
(xij = yi) A (a?i a = 2/2) A ... A (a;j r = t/ r ) and g is the function that maps 
j G {1, . . . , r} to the position of in a and that maps j G {r+1, . . . , r+/} 
to the position of Zj— r in a. 

□ 

Taking fc = and R equal to any nonempty relation in the above theorem, 
we obtain: 

Corollary 7. Over the class of nonempty databases GF sentences and 0-ary 
SA = expressions have equal expressive power. 

Here, a database is said to be empty if all its relations are empty. 

Let us now allow arbitrary semijoin conditions (still over equality only). 
Specifically, nonequalities are now allowed. We will denote the semijoin algebra 
over = {=} by SA^. Then, GF no longer subsumes SA^. A counterexample 
is the query that asks whether there are at least two distinct elements in a single 
unary relation S. This is expressible in SA^ as S« xi ^ yi S, but is not expressible 
in GF. Indeed, a set with a single clement is guarded bisimilar to a set with two 
elements |21ITU]. 

Unfortunately, these nonequalities in semijoin conditions make SA undecid- 
able. 

Theorem 8. Satisfiability of SA^ expressions is undecidable. 

Proof. Gradel [5] Theorem 5.8] shows that GF with functionality statements 
in the form of functional [D], saying that the binary relation D is the graph 
of a partial function, is a conservative reduction class. Since functional [D] is 
expressible in SA^ as D «x 1 =y 1 /\x 2 ^y2 -D = 0, it follows that SA^ is undecidable. 

□ 

In the next section, we will generalize guarded bisimulation to the semijoin 
algebra, with arbitrary quantifier-free formulas over f2 as semijoin conditions. 
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4 An Ehrenfeucht-Fraisse game for the semijoin 
algebra 

In this section, we describe an Ehrenfeucht-Fraisse game that characterizes the 
discerning power of the semijoin algebra. 

Let A and B be two databases over the same schema S. The semijoin 
game on these databases is played by two players, called the spoiler and the 
duplicator. They, in turn, choose tuples from the tuple spaces Ta and Tb, which 
are defined as follows: Ta := Ufles U { n x(A(R)) | X C {1, . . . , arity(-R)}}, and 
Tb is defined analogously. So, the players can pick tuples from the databases 
and projections of these. 

At each stage in the game, there is a tuple a G Ta and a tuple b G Tb- We 
will denote such a configuration by (A, a; B, b). The conditions for the duplicator 
to win the game with rounds are: 

1. VR G S, VX C {1, . . . , arity(i?)} : a G n x (A(R)) oie ir x (B(R)) 

2. for every atomic formula (equivalently, for every quantifier- free formula) 
9 over O, 6(a) holds iff 9(b) holds. 

In the game with m ^ 1 rounds, the spoiler will be the first one to make a 
move. Therefore, he first chooses a database (A or B). Then he picks a tuple 
in Ta or in Tb respectively. The duplicator then has to make an "analogous" 
move in the other tuple space. When the duplicator can hold this for m times, 
no matter what moves the spoiler takes, we say that the duplicator wins the 
m-round semijoin game on A and B. The "analogous" moves for the duplicator 
are formally defined as legal answers in the next definition. 

Definition 9 (legal answer). Suppose that at a certain moment in the semi- 
join game, the configuration is (A, a; B, b). If the spoiler takes a tuple c € Ta in 
his next move, then the tuples d E Tb, for which the following conditions hold, 
are legal answers for the duplicator: 

1. VR G S, VX C {1, . . . , arity(iZ)} : d G n x (B(R)) Oc£ n x (A(R)) 

2. for every atomic formula 9 over Q, 9(a,c) holds iff 9(b 7 d) holds. 

If the spoiler takes a tuple d G , the legal answers c e Ta are defined identi- 
cally. 

In the following, we denote the semijoin game with initial configuration 
(^4, a; B, b) and that consists of m rounds, by G m (A, a; B, b). 
We first state and prove 

Proposition 10. If the duplicator wins G m (A, a; B, b), then for each semijoin 
expression E with ^ m nested semijoins and projections, we have a G E(A) O 
beE(B). 



G 



Proof. We prove this by induction on to. The base case to = is clear. Now 
consider the case m > 0. Suppose that a G E± Kg E 2 {A) but b £ E± Kg E 2 (B). 
Then a G Et(A) and 3c G B 2 (-A) : 9(a, c), and either (*) 6 £ or (**) -.3 J G 

E2(B) : 9{b, d). In situation (*), a and 6 are distinguished by an expression with 
to — 1 semijoins or projections, so the spoiler has a winning strategy; in situation 
(**), the spoiler has a winning strategy by choosing this c G £^(^4) with 0(a, c), 
because each legal answer of the duplicator d has 9(b, d) and therefore d G" E2(B). 
So, the spoiler now has a winning strategy in the game G m -\(A, c; B, d). In case 
a projection distinguishes a and 6, a similar winning strategy for the spoiler 
exists. In case a and b are distinguished by an expression that is neither a 
semijoin, nor a projection, there is a simpler expression that distinguishes them, 
so the result follows by structural induction. □ 

We now come to the main theorem of this section. This theorem concerns 
the game Goo(A, a; B, b), which we also abbreviate as G(A, a; B, b). We say that 
the duplicator wins G(A,a;B,b) if the spoiler has no winning strategy. This 
means that the duplicator can keep on playing forever, choosing legal answers 
for every move of the spoiler. 

Theorem 11. The duplicator wins G(A, a; B, b) if and only if for each semijoin 
expression E, we have a G E(A) b G E(B). 

Proof. The 'only if direction of the proof follows directly from Proposition 
because if the duplicator wins G(A,a; B,b), he wins G m (A,a; B,b) for every 
to ^ 0. So, a and b are indistinguishable through all semijoin expressions. For 
the 'if direction, it is sufficient to prove that if the duplicator loses, a and b are 
distinguishable. We therefore construct, by induction, a semijoin expression E^ 
such that (i) a G E?(A), and (ii) b G E£(B) iff the duplicator wins G r (A, a; B, b). 
We define E® as 

^ a (D n -u u 

-Res {xcz|ae7r x (A(i?.))} Res {xcz|o^7t x (A(_r))} 

In this expression, Z is a shorthand for {1, . . . , arity(i?)} and 9 a is the atomic 
type of a over CI, i.e., the conjunction of all atomic and negated atomic formulas 
over CI that are true of a. 

We now construct E^ in terms of E^ 1 : 

fl (*§ ►<•..- ^r 1 ) n {Ei - u u {si Kg n { E ^r mpl )) 

ceT A 3 = 1 9 5GT A 

6(a,c) 

In this expression, 9 a ,c is the atomic type of a and c over Cl; s is the maximal 
arity of a relation in S; 9 ranges over all atomic f2-types of two tuples, one with 
the arity of a, and one with arity j. The notation E compl , for an expression of 
arity k, is a shorthand for 

(U U -x(R))-E 

-ReS XC{l,...,arity(-R)} 
#X=k 
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□ 



5 Queries inexpressible in the semijoin algebra 

Gradel [8] already showed that transitivity is not expressible in the guarded 
fragment. We will now show that transitivity is still inexpressible in the more 
powerful semijoin algebra. 

Theorem 12. Transitivity is inexpressible in the semijoin algebra. 

Proof. We will give two databases A and B over the schema S containing a 
single relation R, that are indistinguishable by semijoin expressions, and with 
the property that R is transitive in A and not in B. These databases are shown 
graphically in Figure ^ 

Figure 1: Databases A (left) and B (right) that imply inexpressibility of tran- 
sitivity in the semijoin algebra 

In this figure the edges represent the relation R. A moment's inspection re- 
veals that the duplicator has a winning strategy in the semijoin game G(A, ();B,()). 
For the sake of completeness we give here the formal strategy. We do this by 
using the following bijections from tuple space T4 to Tg. 



f : T A — ► T B g :T A ^T B 
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+ Zi 
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/- 
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/- 


-* i 


d/n 
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When the spoiler makes his first move, the duplicator has a legal answer by 
taking the image or pre-image of the spoiler's chosen tuple under bijection /. 
The duplicator now continues answering each spoiler move by applying / or / 
to the chosen tuple, until: 

• in configuration (A, ac; B, gl) the spoiler chooses be or kl, or 

• in configuration (A, be; B, hi) the spoiler chooses ac or ji, or 
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• in configuration (A, df; B, ji) the spoiler chooses ef or hi, or 

• in configuration [A, ef; B, kl) the spoiler chooses df or gl. 

In either case, the duplicator answers with the tuple obtained from applying 
g or g~ x to the chosen tuple, and from then, he follows strategy function g. 
Following g, he switches back to strategy function / whenever: 

• in configuration (A, ac; B,ji) the spoiler chooses be or hi, or 

• in configuration (A, be; B, kl) the spoiler chooses ac or gl, or 

• in configuration (A,df;B,gl) the spoiler chooses ef or kl, or 

• in configuration (A, ef; B, hi) the spoiler chooses df or ji. 

□ 

Another example of a query inexpressible in the semijoin algebra is the 
following: 

Theorem 13. The query R = tt\(R) x ^{R)? about a binary relation R is 
inexpressible in the semijoin algebra. 

Proof. In Figure [21 two databases A and B are shown where A satisfies the 
query and B does not. The duplicator has a winning strategy in the semijoin 
game G(A,Q; B,Q). □ 




Figure 2: Databases A (left) and B (right) that imply inexpressibility of the 
query of Theorem 1131 



6 Impact of order 

In this section, we investigate the impact of order. On ordered databases (where 
Q now also contains a total order on the domain), the query that asks if there 
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are at least k elements in a unary relation S becomes expressible as atJeast(fc), 
which is inductively defined as follows: 

f aUeast(l) := S 

\ atjeast(fc) := S tx Xl<yi (atJeast(A; — 1)) 

Note that this query is independent of the chosen order. This parallels the 
situation in first-order logic, where there also exists an order-invariant query 
that is expressible with but inexpressible without order ([XJ Exercise 17.27] and 
Proposition 2.5.6]). 

Transitivity remains inexpressible in the semijoin algebra even on ordered 
databases. Consider the following databases A and B over a single binary rela- 
tion R: A(R) is the union of X, Y and Z, where X = {1, . . . , m} X {2m + 1}, 
Y = {2m + 1} x {m + 1, . . . , 2m}, and Z = {1, . . . , m} x {m + 1, . . . , 2m}; 
B(R) = A(R) - {(™±i, TO + m ^)}. Clearly, R is transitive in A, but not in B. 
We have shown elsewhere ^T] that when m = 2n + 1, the duplicator has a win- 
ning strategy in the n-round semijoin game G n (A, (); B, (}). By Proposition llUI 
transitivity is not expressible in SA with order. 
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